-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewrite: Pythonic, ocrd v3, utilise page-level annotation #28
base: master
Are you sure you want to change the base?
Conversation
Thank you very much @bertsky , |
…leGrp for images, no parsing/validation/repair)
…add pytest option --workspace for subsets, determine input fileGrp automatically, download and process up to 4 random pages only, test PAGE2PDF and ALTO2PDF, depending on whether PAGE or ALTO is in the workspace
next up:
Notice that this also supports things like I wonder if the multipage file ID should really be specified manually. It may be difficult to come up with a non-conflicting name in a scripted setting. In contrast, the tool itself could try with |
Thanks @bertsky for all your work. We still have to decide how we want to proceed with the repository in general. I hope that we will have made a decision by the end of the week. |
done. I have also added continuous deployment. You would need to add the following in the repo settings to make everything work:
What do you mean? |
We are talking with the OCR-D coordination team about moving this repository to https://github.com/OCR-D/. |
I see. If and when that's certain, please let me know so I can adapt the upstream URLs (packaging, CI+CD) before this is merged. |
Edited: We have decided to move the repository to https://github.com/OCR-D/. @kba should now be able to carry out the transfer at the right time. |
done: 7021614 |
First attempt at a full OCR-D processor for this. Builds on core 3.0 – which brings error handling and page parallelism. (For that we require Python instead of bashlib. But Python is already much faster sequentially.)
clipped
) – it does not suffice to pass the original image and PAGE file name to the converter; instead, one needs to extract/generate the derived image for the page, and then transform all coordinates of the PAGE accordingly. This is borrowed fromocrd-segment-replace-original
.get_metadata
(with its XML tree operations) needs to read from a filesystem copy of the METS instead of theClientSideOcrdMets
multipagepdf
code, but separated the functionsnegative2zero
was too simplistic. In OCR-D, we have the PageValidator against all kinds of coordinate invalidities and inconsistencies. I borrowed from ocrd-segment-repair for actual repairs (although this is debatable, we should keep this as a separate step; also, I had to copy and paste a lot of polygon handling code).However, there still seems to be a problem with the coordinates of the outlines...
Next I'll add further improvements:
binarized
or not)font
param as installable resmgr resourcesThis depends on OCR-D/core#1305.